In this report, we will analyse data about LEGO. All data comes from https://rebrickable.com/downloads/, which is a site that allows users to share their MOCs (My Own Creation - constructions created by LEGO fans).

We’ll start by looking at the data schema, the tables and the data they contain. We will see how the tables interconnect and understand what the values of the different attributes are. We will also learn about the size of the entire dataset. The next step will be an analysis on the data from individual or linked tables. We will start with minifigures, then move on to parts and finish the analysis on sets. During the analyses, we will wrestle with missing data. Last but not least we will try to predict the future using forecasting. There we will focus on all sets and a Star Wars set.

After the analysis, the most striking observation is that LEGO is growing all the time. Every year there are more sets and the number of parts increases. Some interesting insights can be reached by observing the data on themes. Many of them are well-known brands (Ben 10, Avatar, Angry Bird) that appear at the peak of the brand hype. In addition to this, there are many generic themes (Town, City) that have been with LEGO for a long time and are becoming more popular every year. It’s also worth to note that at the beginning LEGO didn’t cooperate with many external brands which is why we don’t see many sets of external brands at the beginning of the company’s operation.

Most of the plots are interactive so keep in mind that if you want to find out more about presented data just hover over it.

Libraries

library(ggplot2)
library(dplyr)
library(png)
library(grid)
library(plotly)
library(gganimate)
library(tidyr)
library(forecast)

Basic tables analysis

Data schema

We can clearly see that data is divided into 3 parts. Going from left to right, the first concerns parts and their color, second concerns minifigures and the last one is about sets. All parts have table with prefix inventory (for example: inventory_sets) which contains stock data. It is possible to connect data through inventories table.

Tables

Inventories

Table linking parts, minifigures and sets.
  • id: inventory id
  • version: inventory version
  • set_num: set id
id version set_num
1 1 7922-1
3 1 3931-1
4 1 6942-1
15 1 5158-1
16 1 903-1
17 1 850950-1
id version set_num
Min. : 1 Min. : 1.000 Length:37265
1st Qu.: 14424 1st Qu.: 1.000 Class :character
Median : 54379 Median : 1.000 Mode :character
Mean : 61104 Mean : 1.091 NA
3rd Qu.: 88842 3rd Qu.: 1.000 NA
Max. :194312 Max. :16.000 NA

Inventory sets

Sets stock
  • inventory_id: inventory id
  • set_num: set id
  • quantity: number of sets
inventory_id set_num quantity
35 75911-1 1
35 75912-1 1
39 75048-1 1
39 75053-1 1
50 4515-1 1
50 4520-1 2
id version set_num
Min. : 1 Min. : 1.000 Length:37265
1st Qu.: 14424 1st Qu.: 1.000 Class :character
Median : 54379 Median : 1.000 Mode :character
Mean : 61104 Mean : 1.091 NA
3rd Qu.: 88842 3rd Qu.: 1.000 NA
Max. :194312 Max. :16.000 NA

Sets

Data related to sets
  • set_num: set id
  • name: set name
  • year: when set was released
  • theme_id: theme id
  • num_parts: number of parts in set
set_num name year theme_id num_parts img_url
001-1 Gears 1965 1 43 https://cdn.rebrickable.com/media/sets/001-1.jpg
0011-2 Town Mini-Figures 1979 67 12 https://cdn.rebrickable.com/media/sets/0011-2.jpg
0011-3 Castle 2 for 1 Bonus Offer 1987 199 0 https://cdn.rebrickable.com/media/sets/0011-3.jpg
0012-1 Space Mini-Figures 1979 143 12 https://cdn.rebrickable.com/media/sets/0012-1.jpg
0013-1 Space Mini-Figures 1979 143 12 https://cdn.rebrickable.com/media/sets/0013-1.jpg
0014-1 Space Mini-Figures 1979 143 2 https://cdn.rebrickable.com/media/sets/0014-1.jpg
set_num name year theme_id num_parts img_url
Length:21880 Length:21880 Min. :1949 Min. : 1 Min. : 0.0 Length:21880
Class :character Class :character 1st Qu.:2001 1st Qu.:273 1st Qu.: 3.0 Class :character
Mode :character Mode :character Median :2012 Median :497 Median : 31.0 Mode :character
NA NA Mean :2008 Mean :442 Mean : 161.4 NA
NA NA 3rd Qu.:2018 3rd Qu.:608 3rd Qu.: 139.0 NA
NA NA Max. :2024 Max. :752 Max. :11695.0 NA

Themes

Data related to themes
  • id: theme id
  • name: theme name
  • parent_id: theme id of a parent theme (self-referential foreign key)
id name parent_id
1 Technic NA
3 Competition 1
4 Expert Builder 1
16 RoboRiders 1
17 Speed Slammers 1
18 Star Wars 1
id name parent_id
Min. : 1.0 Length:468 Min. : 1.0
1st Qu.:250.5 Class :character 1st Qu.:186.0
Median :466.0 Mode :character Median :411.0
Mean :433.5 NA Mean :360.6
3rd Qu.:625.2 NA 3rd Qu.:512.5
Max. :752.0 NA Max. :697.0
NA NA NA’s :145

Inventory Minifigs

Minifigures stock
  • inventory_id: inventory id
  • fig_num: minifigures id
  • quantity: number of minifigures
inventory_id fig_num quantity
3 fig-001549 1
4 fig-000764 1
19 fig-000555 1
25 fig-000574 1
26 fig-000842 1
26 fig-008641 1
inventory_id fig_num quantity
Min. : 3 Length:20858 Min. : 1.000
1st Qu.: 7869 Class :character 1st Qu.: 1.000
Median : 15681 Mode :character Median : 1.000
Mean : 43010 NA Mean : 1.062
3rd Qu.: 66834 NA 3rd Qu.: 1.000
Max. :194312 NA Max. :100.000

Minifigs

Data related to minifigures
  • fig_num: minifigures id
  • name: minifigure name
  • num_parts: number of parts in minifigure
fig_num name num_parts img_url
fig-000001 Toy Store Employee 4 https://cdn.rebrickable.com/media/sets/fig-000001.jpg
fig-000002 Customer Kid 4 https://cdn.rebrickable.com/media/sets/fig-000002.jpg
fig-000003 Assassin Droid, White 8 https://cdn.rebrickable.com/media/sets/fig-000003.jpg
fig-000004 Man, White Torso, Black Legs, Brown Hair 4 https://cdn.rebrickable.com/media/sets/fig-000004.jpg
fig-000005 Captain America with Short Legs 3 https://cdn.rebrickable.com/media/sets/fig-000005.jpg
fig-000006 Lloyd Avatar 5 https://cdn.rebrickable.com/media/sets/fig-000006.jpg
fig_num name num_parts img_url
Length:13764 Length:13764 Min. : 0.000 Length:13764
Class :character Class :character 1st Qu.: 4.000 Class :character
Mode :character Mode :character Median : 4.000 Mode :character
NA NA Mean : 5.296 NA
NA NA 3rd Qu.: 5.000 NA
NA NA Max. :156.000 NA

Inventory parts

Parts stock
  • inventory_id: inventory id
  • part_num: part id
  • color_id: color id
  • quantity: number of parts
  • is_spare: is part a spare part
inventory_id part_num color_id quantity is_spare img_url
1 48379c01 72 1 f https://cdn.rebrickable.com/media/parts/photos/1/48379c01-1-e7daa845-2671-4737-8642-3b1574308155.jpg
1 48395 7 1 f https://cdn.rebrickable.com/media/parts/photos/7/48395-7-b9152acf-2fa5-4836-a04d-5b7fd39c2406.jpg
1 stickerupn0077 9999 1 f
1 upn0342 0 1 f
1 upn0350 25 1 f
3 2343 47 1 f https://cdn.rebrickable.com/media/parts/elements/3000240.jpg
inventory_id part_num color_id quantity is_spare img_url
Min. : 1 Length:1180987 Min. : -1.0 Min. : 1.00 Length:1180987 Length:1180987
1st Qu.: 9404 Class :character 1st Qu.: 4.0 1st Qu.: 1.00 Class :character Class :character
Median : 22838 Mode :character Median : 15.0 Median : 2.00 Mode :character Mode :character
Mean : 50849 NA Mean : 131.8 Mean : 3.37 NA NA
3rd Qu.: 87088 NA 3rd Qu.: 71.0 3rd Qu.: 4.00 NA NA
Max. :194312 NA Max. :9999.0 Max. :3064.00 NA NA

Colors

Data related to elements color
  • id: color id
  • name: color name
  • rgb: color in rgb format
  • is_trans: is color a transparent color
id name rgb is_trans
-1 [Unknown] 0033B2 f
0 Black 05131D f
1 Blue 0055BF f
2 Green 237841 f
3 Dark Turquoise 008F9B f
4 Red C91A09 f
id name rgb is_trans
Min. : -1.0 Length:263 Length:263 Length:263
1st Qu.: 83.0 Class :character Class :character Class :character
Median :1005.0 Mode :character Mode :character Mode :character
Mean : 651.4 NA NA NA
3rd Qu.:1070.5 NA NA NA
Max. :9999.0 NA NA NA

Parts

Data related to parts
  • part_num: part id
  • name: part name
  • part_cat_id: part category id
part_num name part_cat_id part_material
003381 Sticker Sheet for Set 663-1 58 Plastic
003383 Sticker Sheet for Sets 618-1, 628-2 58 Plastic
003402 Sticker Sheet for Sets 310-3, 311-1, 312-3 58 Plastic
003429 Sticker Sheet for Set 1550-1 58 Plastic
003432 Sticker Sheet for Sets 357-1, 355-1, 940-1 58 Plastic
003434 Sticker Sheet for Set 575-2, 653-1, 460-1 58 Plastic
part_num name part_cat_id part_material
Length:52615 Length:52615 Min. : 1.00 Length:52615
Class :character Class :character 1st Qu.:17.00 Class :character
Mode :character Mode :character Median :41.00 Mode :character
NA NA Mean :38.91 NA
NA NA 3rd Qu.:60.00 NA
NA NA Max. :68.00 NA

Elements

Data related to part elements
  • element_id: element id
  • part_num: part id
  • color_id: color id
element_id part_num color_id design_id
6443403 2277c01pr0009 1 2277
6300211 67906c01 14 67908
4566309 2564 0 2564
4275423 53657 1004 53657
6194308 92926 71 28967
6229123 26561 4 26561
element_id part_num color_id design_id
Min. : 9327 Length:84138 Min. : -1.0 Min. : 1001
1st Qu.: 4259774 Class :character 1st Qu.: 8.0 1st Qu.: 18454
Median : 6057754 Mode :character Median : 28.0 Median : 41748
Mean : 5222065 NA Mean : 539.7 Mean : 45570
3rd Qu.: 6262024 NA 3rd Qu.: 135.0 3rd Qu.: 75475
Max. :61532443 NA Max. :9999.0 Max. :107520
NA NA NA NA’s :23682

Part Relationships

Data related to parts relationship
  • rel_type: type of relationship (rel_types are: (P)rint, Pai(R), Su(B)-Part, (M)old, Pa(T)tern, (A)lternate)
  • child_part_num: id of child part (foreign key to parts)
  • parent_part_num: id of parent part (foreign key to parts)
rel_type child_part_num parent_part_num
P 3626cpr3662 3626c
P 87079pr9974 87079
P 3960pr9971 3960
R 98653pr0003 98086pr0003
R 98653pr0003 98088pat0003
R 98653pr0003 98089pat0003
rel_type child_part_num parent_part_num
Length:29977 Length:29977 Length:29977
Class :character Class :character Class :character
Mode :character Mode :character Mode :character

Part categories

Data related to parts category
  • id: category id
  • name: category name
id name
1 Baseplates
3 Bricks Sloped
4 Duplo, Quatro and Primo
5 Bricks Special
6 Bricks Wedged
7 Containers
id name
Min. : 1.00 Length:66
1st Qu.:19.25 Class :character
Median :35.50 Mode :character
Mean :35.36 NA
3rd Qu.:51.75 NA
Max. :68.00 NA

Dataframes size

  • All rows in dataframes: 1446639
  • All columns in dataframes 45
  • All values in dataframes: 8099232

Analyses

Minifigures

Sum of rows with NA value: 0

Largest stock of minifigures

name quantity img_tag
Woman, Blue Torso with White Arms, White Legs 101
Battle Droid, One Bent Arm, One Straight Arm 96
Skeleton, Standard Face, Ball Joint Arms (3626b Head) 59
Classic Spaceman, White with Airtanks (3842a Helmet) 51
Classic Spaceman, Red with Airtanks (3842a Helmet) 50
Battle Droid, Two Bent Arms 50
Pit Crew, Red Torso, Red Legs, Ferrari 34
Man, Blue Shirt, Blue Legs, Red Hard Hat 30
Martian 29
Classic Spaceman, Yellow with Airtanks (3842b Helmet) 27
Steve 27

Parts

Sum of rows with NA value: 0

Material Type of Lego Parts

Log10(Number of Parts) is used instead of just displaying Number of Parts because difference between Plastic and other materials is so big that they would be almost invisible on plot.

Part categories

Sum of rows with NA value: 0

Parts with categories

Sum of rows with NA value: 0

Part of the table with parts and categories

part_num part_cat_id part_material name_categories
003381 58 Plastic Stickers
003383 58 Plastic Stickers
003402 58 Plastic Stickers
003429 58 Plastic Stickers
003432 58 Plastic Stickers
003434 58 Plastic Stickers

We can see that there are 8 main categories (Tiles Round and Curved - Tubes and Hoses)

Elements

Sum of rows with NA value: 23682
All NA values are in design_id column
element_id part_num color_id design_id
0 0 0 23682

In the data schema there is no indication of what the design_id is or where it is located, so we will remove the column from the dataframe

Parts + Categories with Elements

Sum of rows with NA value: 37700
18850 NA values are in element_id and color_id columns. This means that there are some parts that do not consist of elements
part_num part_cat_id part_material name_categories element_id color_id
0 0 0 0 18850 18850

Part of the table with parts + categories with elements

part_num part_cat_id part_material name_categories element_id color_id
003381 58 Plastic Stickers NA NA
003383 58 Plastic Stickers NA NA
003402 58 Plastic Stickers NA NA
003429 58 Plastic Stickers NA NA
003432 58 Plastic Stickers NA NA
003434 58 Plastic Stickers NA NA

Largest stock of parts

name quantity img_tag
Plate 1 x 2 127955
Plate Round 1 x 1 with Solid Stud 119278
Brick 1 x 2 99847
Plate 1 x 1 92323
Brick 1 x 1 75605
Technic Pin with Friction Ridges Lengthwise and Center Slots 68443
Plate 1 x 4 57805
Brick 2 x 2 55379
Slope 30° 1 x 1 x 2/3 (Cheese Slope) 50685
Tile Round 1 x 1 50410

Colors

Sum of rows with NA value: 0

Parts with Colors

part_num part_cat_id part_material name_categories element_id color_id name rgb is_trans
003381 58 Plastic Stickers NA NA NA NA NA
003383 58 Plastic Stickers NA NA NA NA NA
003402 58 Plastic Stickers NA NA NA NA NA
003429 58 Plastic Stickers NA NA NA NA NA
003432 58 Plastic Stickers NA NA NA NA NA
003434 58 Plastic Stickers NA NA NA NA NA
Sum of rows with NA value: 94250
In this case NA values mean that there are some parts that do not consist of elements
part_num part_cat_id part_material name_categories element_id color_id name rgb is_trans
0 0 0 0 18850 18850 18850 18850 18850

Parts without color are the parts that do not consist of elements

Themes

Themes sum of rows with NA value

Sum of rows with NA value: 145

All NA values are in parent_id column

id name parent_id
0 0 145

Parent themes collect other themes, which are their sub-themes. Part of parent themes names:

## [1] "Technic"       "Creator"       "Make & Create" "Town"         
## [5] "City"          "Racers"

Join parent themes with themes. Insert parent theme for their children.

id name_child name_parent
1 Technic Technic
3 Competition Technic
4 Expert Builder Technic
16 RoboRiders Technic
17 Speed Slammers Technic
18 Star Wars Technic

Sets

Sum of rows with NA value: 0

In sets dataframe there is data from 2024 and 2023. At the time the report was created, the data for these years was not yet complete so we will delete each observation from 2024 and 2023.

Histogram for Number of Parts in sets_df

We will remove some rows from sets_df which has really big Number of Parts. They diverge significantly from other values, making our graph unreadable.

set_num name year theme_id num_parts img_url
264 10179-1 Millennium Falcon 2007 171 5198 https://cdn.rebrickable.com/media/sets/10179-1.jpg
274 10189-1 Taj Mahal 2008 673 5922 https://cdn.rebrickable.com/media/sets/10189-1.jpg
350 10256-1 Taj Mahal 2017 673 5923 https://cdn.rebrickable.com/media/sets/10256-1.jpg
372 10276-1 Colosseum 2020 721 9036 https://cdn.rebrickable.com/media/sets/10276-1.jpg
382 10284-1 Camp Nou - FC Barcelona 2021 721 5509 https://cdn.rebrickable.com/media/sets/10284-1.jpg
391 10294-1 Titanic 2021 721 9092 https://cdn.rebrickable.com/media/sets/10294-1.jpg

Number of rows that we ignore: 19

Correlation between year and num_parts

Pearson correlation between year and Sum of Parts: 0.8061769.

The most interesting thing in heatmap below is really low correlation between median and sum_parts/count_sets. This means that altough the number of parts and sets is increasing the median does not share this trend.

Sets with Themes

Sum of rows with NA value: 0

Part of the table with sets and themes

year num_parts img_url name_child name_parent
1965 43 https://cdn.rebrickable.com/media/sets/001-1.jpg Technic Technic
1979 12 https://cdn.rebrickable.com/media/sets/0011-2.jpg Classic Town Town
1987 0 https://cdn.rebrickable.com/media/sets/0011-3.jpg Lion Knights Castle
1979 12 https://cdn.rebrickable.com/media/sets/0012-1.jpg Supplemental Space
1979 12 https://cdn.rebrickable.com/media/sets/0013-1.jpg Supplemental Space
1979 2 https://cdn.rebrickable.com/media/sets/0014-1.jpg Supplemental Space

We can see that most themes have less than 500 sets and less than 50.000 parts.

Largest stock of sets

name sum_quantity img_tag
Series 19 - Random Bag 110
Unikitty! Series 1 - Random Bag 60
Series 9 - Random Bag 60
Series 10 - Random Bag 60
Series 11 - Random Bag 60
The LEGO Movie Series 1 - Random Bag 60
The Simpsons Series 1 - Random Bag 60
Series 12 - Random Bag 60
Series 13 - Random Bag 60
The Simpsons Series 2 - Random Bag 60

Forecasting

Forecast Number of Sets

Model predicts that in 2025 there will be: 1127 sets and in 2030 there will be: 1201 sets.

Forecast Number of Parts in all Sets

Model predicts that in 2025 there will be: 394662 parts and in 2030 there will be: 507113 parts.

Forecast Number of Star Wars Sets

Model predicts that in 2025 there will be: 42 Star Wars sets and in 2030 there will be: 42 Star Wars sets.

Forecast Number of Parts in Star Wars Sets

Model predicts that in 2025 there will be: 24861 Star Wars parts and in 2030 there will be: 28797 Star Wars parts.

Github: https://github.com/C7A7A/lego_analysis